Balancing trust and safety: Lessons from the CrowdStrike incident

On July 19, CrowdStrike, one of the largest endpoint security providers, issued an update to Windows servers globally that caused them to enter a “crash loop,” resulting in the infamous Blue Screen of Death (BSOD). CrowdStrike explained the issue was caused by a “defect found in a single content update for Windows hosts.” The downstream impact of this bug resulted in one of the largest and broadest outages in history.

When situations like this occur, everyone is scrambling — vendors and CISOs alike — trying to uncover what happened and determine a way to fix or contain the impact before threat actors catch on and try to exploit a vulnerability. While the spotlight is on the vendor to provide a mitigation method or fix, companies often can’t wait long due to the impacts on their day-to-day and the enormous pressure placed on CIOs and CISOs whenever a business impact occurs — making situations like this even more critical. 

What exactly happened?

While the story is still developing, a poorly formatted update file appears to have caused CrowdStrike’s top-level drivers to crash. This affected any customer using Windows machines that undergo automatic updates and caused mass outages globally, affecting industries ranging from healthcare and automotive to airlines and banks.

Typically, vendors release staged or rolling updates, which allow companies time to test and incrementally deploy over time to a subset of machines to ferret out any issues before the update affects everyone. However, in this case, there wasn’t an option for that level of due diligence for customers who pre-selected the “auto-update” function. This raises an important question: Are auto-updates worth it?

To automatically update… or not?

There is undoubtedly a great deal of pressure on security vendors, CISOs and CIOs to stay one step ahead of threat actors, especially as AI and emerging technologies have expanded attack surfaces. The pressure on security leaders leads them to emphasize speed and choose automatic updates for all computers over manual (picking the time), staged (not rolling to every system) or versioned (remaining on N-1 or N-2 version behind) software updates.

Automatic updates are encouraged across the industry for heightened security. Still, the reality is that it is very rare for there to be a significant leap in the ability to protect and secure the organization between new versions of the agents. Further, unless companies carefully create policies to stage auto-updates to a subset of “test” machines first, they risk propagating problems to everyone, forcing a subsequent impact and need to roll back the deployment (if they can).   

While customers can opt into manual or staged updates, many companies also choose auto-updates for ease — ease on the IT and security teams to not expend the effort to push updates manually. Unfortunately, that “ease” comes at a tradeoff. It doesn’t allow customers to test and validate updates before rolling them out to all systems.

There is a baked-in fear within security and IT teams that performing manual updates may cause them to miss a critical security update and require them to stay abreast of every vendor’s update. Further, it requires teams to develop deployment strategies, which they often don’t have the team or resources to do — especially for smaller organizations — making them more reliant on a single vendor.  

While there may be an incremental benefit to an auto-update — staying as current as possible — this doesn’t become more important than being stable, available and accurate. This latest outage of Microsoft machines tells the industry that there needs to be an industry shift. It should always be best practice to stage and test updates prior to deploying them against critical infrastructure. 

Prioritizing speed over availability 

Information security teams have long preached that the C-I-A triad represents the foundational pillars of cybersecurity (Confidentiality, Integrity, Availability). Unfortunately, the industry focuses on “staying one step ahead of the bad guys” and emphasizes the ‘C,’ often at the expense of the ‘I’ and ‘A.’ There is also, at times, an unhealthy tension between technology and security teams, with technology wanting exclusive ownership of the ‘I’ and ‘A’ while security teams focus on the ‘C.’ 

Instead, it is essential that vendors, as well as the CIOs and CISOs of their customers, carefully balance each letter of the triad. CISOs need to care more about the ‘I’ and ‘A,’ and CIOs need to care more about the ‘C.’ If there is anything to learn from the CrowdStrike incident, it might be that while speed is important, people need to take more time to ensure that integrity and availability are not compromised. They need to wait to deploy updates to a time of their choosing and stage and test updates before they broadly deploy them to everyone.

SaaS vendors are under a lot of pressure to rush new releases as quickly as possible — that pressure comes from customers and security teams — and sometimes, quick updates are necessary for cybersecurity. Fast updates are an essential measure to protect against attackers. When informed about a threat, an attack technique, or a vulnerability being exploited, a quick update is sometimes the best remedy. If that update can be sent before the attack, it can prevent and stop the attack.

But rapid release at the expense of availability and integrity comes at a cost. Speed completely conflicts with the safety measures expected from enterprise software. People expect rigorous QA and gradual rollout. These inherently slow down the updates. Being fast and easy to deploy, confidentiality improvements can’t overpower the assurance of accuracy (integrity) and stability (availability) of the update. 

An alternative approach going forward

Reaching the balance between speed and integrity is very, very difficult. However, choosing to perform manual updates on critical systems puts customers in control of how and when their systems are updated. While change is the father of uncertainty, organizations and security teams all need to commit to doing more to ensure that there are certain changes and updates that will only have a positive impact. Manual updates might not be possible for everything, but at least put that in place for critical systems.

As security professionals reflect on the lessons from the CrowdStrike incident, they must collectively prioritize a balanced approach to cybersecurity, one that harmonizes innovation with trust, safety and resilience. In the end, the true measure of cybersecurity prowess lies not in how fast things are done, nor in the ability to innovate, but in the capacity to endure. Security leaders must embrace those proven patterns of change management that have served organizations so well in the past, but they also must evolve. 



Source link